MR-KPA: medication recommendation by combining knowledge-enhanced pre-training with a deep adversarial network

Background Medication recommendation based on electronic medical record (EMR) is a research hot spot in smart healthcare. For developing computational medication recommendation methods based on EMR, an important challenge is the lack of a large number of longitudinal EMR data with time correlation. Faced with this challenge, this paper proposes a new EMR-based medication recommendation model called MR-KPA, which combines knowledge-enhanced pre-training with the deep adversarial network to improve medication recommendation from both feature representation and the fine-tuning process. Firstly, a knowledge-enhanced pre-training visit model is proposed to realize domain knowledge-based external feature fusion and pre-training-based internal feature mining for improving the feature representation. Secondly, a medication recommendation model based on the deep adversarial network is developed to optimize the fine-tuning process of pre-training visit model and alleviate over-fitting of model caused by the task gap between pre-training and recommendation. Result The experimental results on EMRs from medical and health institutions in Hainan Province, China show that the proposed MR-KPA model can effectively improve the accuracy of medication recommendation on small-scale longitudinal EMR data compared with existing representative methods. Conclusion The advantages of the proposed MR-KPA are mainly attributed to knowledge enhancement based on ontology embedding, the pre-training visit model and adversarial training. Each of these three optimizations is very effective for improving the capability of medication recommendation on small-scale longitudinal EMR data, and the pre-training visit model has the most significant improvement effect. These three optimizations are also complementary, and their integration makes the proposed MR-KPA model achieve the best recommendation effect.


Introduction
Electronic medical records (EMRs) represent a patient's historical visit sequence, where each sequence contains a series of clinical events (diagnosis, procedure, medication, etc.) for a single admission. More and more attention has been paid to EMR-based auxiliary diagnosis and treatment, such as clinical knowledge question answering [1,2], health risk warning [3][4][5][6], auxiliary diagnostic [7,8] and electronic prescription recommendation [9,10]. Medication recommendation is an important research direction in EMRbased applications. Given a patient's current clinical events and history of visits, the goal of the medication recommendation task is to provide a personalized combination of medications appropriate to his or her health status. It is a crucial data mining task for an intelligent healthcare system [11] and many important recommendation models have been developed [12][13][14][15][16].
Existing EMR-based medication recommendation methods are mainly data-driven and adopt machine learning methods, especially deep networks, to model on various clinical event sequences. In order to improve the accuracy of recommendation, related studies mainly adopted longitudinal sequential recommendation methods which integrated patient's current health conditions and historical visit information to effectively leverage the temporal dependencies among clinical events for medication recommendation [13,17]. Recent studies focused on developing novel and complex neural networks to capture deep-level data features, including complete structure information [11], drugdrug interactions [12], multiple-level importance [18], relationships between historical and current diagnoses [19], irregular time-series dependencies [20], for improving recommendation capabilities.
However, some diseases may require multiple follow-up visits while others do not. Patients may also visit different hospitals each time resulting in incomplete multiple-visit records. So patients' longitudinal EMR data with multiple visits are relatively few. For example, in the experiment we collected a total data of 151,908 EMRs but only 10,448 EMRs were involved with multiple visits. The longitudinal data only account for 6.9 % of the total data. They are often discontinuous and can lead to information bias in research [21]. The lack of longitudinal data has become an important challenge for EMR-based medication recommendation.
Few-shot learning, which use small sample data for effective model training, is a current research hot spot. Related methods are divided into three categories usually, including fine-tuning, data enhancement and migration [22]. Data enhancement methods [23] usually need high-quality domain knowledge bases and are easy to introduce noise. Migration methods [24] need a group of labeled data in the similar fields for transfer learning. Hence, fine-tuning methods [25], especially pre-training [26], have become the main means for few-shot learning of EMR-based models. At present, EMRs or EHRs pre-training is attracting attentions [27][28][29]. However, existing EMRs pre-training methods need a large number of unlabeled data, which have the same source as labeled data, and neglect the optimization of the fine-tuning process. They also only focus disease prediction tasks whose number of classifications is far lower than medication recommendation tasks. Therefore, these existing EMRs pre-training methods cannot be used directly to solve the problem of lacking longitudinal data in EMR-based medication recommendation.
Based on the above observations and our previous study [30], this paper proposes a MR-KPA model which combines knowledge-enhanced pre-training with a deep adversarial network to realize medication recommendation based on small-scale longitudinal EMR data. The main contributions can be summarized as follows: • Firstly, a knowledge-enhanced pre-training visit model is proposed to realize domain knowledge-based external feature fusion and pre-training-based internal feature mining for improving medication recommendation on small-scale longitudinal EMR data. Different from existing EMRs pre-training methods, this visit model uses a large number of single-visit EMR data for pre-training, in order to avoid splitting longitudinal EMR data that is already insufficient. • Secondly, a medication recommendation model based on the deep adversarial network is developed to apply EMRs pre-training to medication recommendation for the first time. By introducing adversarial training, the fine-tuning process of pretraining visit model can be optimized to alleviate over-fitting of model caused by the task gap between pre-training and recommendation. • Finally, a group of experiments have been performed based on real EMR data from medical and health institutions in Hainan Province, China. Experimental results show that the proposed method can effectively improve the accuracy of medication recommendation based on small-scale longitudinal EMR data.
The rest of this paper is organized as follows. "Related work" section introduces related work. "Medical codes and data sets" section describes medical codes and data sets. "Method" section introduces the proposed MR-KPA model. In "Experiment" and "Discussion" sections, the predictive performance of this model is compared and analyzed with baselines and variants. Finally, "Conclusion" section gives the conclusions and future work.

Related work
Leveraging recommendation algorithms [31,32]to recommend rational and effective medications in time for patients, as a paramount recommendation task in the health domain, has been widely researched [11]. Existing methods are mainly data-driven and depended on large amounts of EMR data. Early approaches often adopted instance-based methods, which only focused on current health conditions and failed to make full use of historical information. Syed-Abdul et al. [33] proposed a smart medication recommendation model for the electronic prescription. In order to reduce the probability of illegal prescription, this smart model adopted the association rule mining technology to find the relationship between two labels for reducing the probability of illegal prescription. Zhang et al. [34] proposed the LEAP model to predict combination of medicines by giving patient's diagnoses. This LEAP model is a variant of sequence-to-sequence model based on contentattention mechanism and, focuses on modeling mappings between instances and tag dependencies.
Obviously, patients' historical EMR data can help to do medication recommendation. At present, studies on EMR-based medication recommendations mainly adopt longitudinal sequential recommendation methods which recommend medications based on both current health conditions and historical information [12,17]. Choi et al. used a two-level neural attention model to detect influential past visits and significant clinical variables within those visits for improved medication recommendation [17]. An et al. proposed a relational perception LSTM (R-LSTM) to deal with the relationship between diseases and medications in longitudinal medical records, which can better integrated historical information into medication level patient representation [13]. Wang et al. proposed the adversarially regularized model for medication recommendation (ARMR), which built a key-value memory network based on information from historical admissions and carried out multi-hop reading on the memory network to recommend medications [12]. An et al. proposed a multilevel selective and interactive network (MeSIN) which fully leveraged the inherent multilevel structure of EHR data to learn a comprehensive patient representation for reasonable medication recommendation [11]. Table 1 gives a comparison of the above EMRs-based medication recommendation methods. As shown in this table, existing studies on longitudinal sequential medication recommendation mainly focused on developing different deep neural networks to capture deep-level features in EMR data. Such approaches depended on massive longitudinal EMR data. Therefore, the lack of longitudinal EMR data has become an important challenge of EMR-based medication recommendation. At present, medication recommendation based on relatively small-scale longitudinal EMR data is not given enough attention. The studies on few-shot learning of EMRs-based models mainly focused on pre-training of EMRs or EHRs data in disease prediction tasks. Various EMRs or EHRs pre-training tasks are designed to learn feature expression from large-scale unlabeled data through a self-supervised learning method [26]. For examples, Rasmy et al. [27] proposed Med-BERT, which adapted the BERT framework originally developed for the text domain to the structured EHR domain. Fine-tuning experiments on two clinical databases showed that Med-BERT can benefit disease prediction studies with small local training datasets, reduce data collection expenses, and accelerate the pace of artificial intelligence aided healthcare. Ren et al. proposed [28] a novel model RAPT, which stands for representation by Pre-training time-aware Transformer, and devise three pre-training tasks to handle data insufficiency, data incompleteness and short sequence problems. Extensive experimental results for four downstream tasks have shown the effectiveness of the proposed approach. Meng et al. [29] presented a temporal deep learning model to perform bidirectional representation learning on EHR sequences with a transformer architecture and the pre-training task of masked language modeling to predict future diagnosis of depression. However, these EMRs pre-training methods cannot be used directly to solve the problem of lacking longitudinal EMR data in EMR-based medication recommendation: • In data, existing EMRs pre-training methods relied on a large number of unlabeled data, which have the same source as labeled data. The existing researches above usually split experimental data and use most of them for pre-training. This method of obtaining pre-training data is not applicable to longitudinal EMR data that is lacking in itself. • In the downstream task, existing EMRs pre-training methods mainly aiming at disease prediction, which is usually a binary classification problem. On the contrary, there are often hundreds of classifications in medication recommendation. Therefore, the application of EMRs pre-training in medication recommendation should be studied separately. • In the fine-tuning process, existing EMRs pre-training methods focused on pretraining tasks and neglected the optimization of the fine-tuning process. However, the gap between pre-training and downstream tasks can bring the catastrophic forgetting problem [35,36]. With the increase of the number of fine-tuning iterations, the downstream tasks increasingly focuses on labelled data and leads to over-fitting of model. Therefore, it is necessary to improve the downstream models for optimizing the fine-tuning process of pre-training model.
In addition, the fusion of knowledge and big data is a recent research hotspot. Integrating formal domain knowledge, such as term ontology [37,38], knowledge graph (KG) [39,40] and so on into deep neural networks has become an important approach to improve feature expression in various applications of deep learning, such as finance [41] and medicine [42]. For EMR-based medication recommendation, fusing domain knowledge to improve feature expression of EMR has also received attention. For an example, Choi et al. represented the medical concept as a combination of its ancestors in the medical ontology using an attention mechanism for enriching the input of EMR-based medication recommendation [17]. However, their studies still only depended on longitudinal EMR data. Though medical concepts enriched feature expression of EMR, model training still needed a large number of EMR data. The training datasets in Choi et al. 's study included three data sets, Sutter PAMF, Mimic-III and Sutter heart Failure (HF) cohort, in which the numbers of visit records were 13920759, 19911 and 572551 respectively.
In order to improve robustness and interpretability of the models, knowledge enhanced pre-training models (KEPTMs) are attracting attention. Yang et al. [43] categorized existing KEPTMs into three groups: entity enhanced pre-trained models [44,45], triplet enhanced pre-trained models [46,47] and other knowledge enhanced pretrained models [48,49]. However, all of these KEPTMs were oriented to text corpora. Though Shang et al. [16] proposed G-Bert which modified Bert pre-training tasks to realize knowledge-enhanced pre-training on large-scale single-visit EMR data, G-Bert only considered two types of medical codes and pre-training tasks only focused on themselves and their relations of medical codes. Other important information, especially symptoms, and its prediction ability for medication recommendation were not considered in pre-training tasks. Moreover, their researches also neglected the gap between pre-training and downstream tasks, which is particularly serious when labelled data are obviously smaller than unlabeled pre-training data. As stated above, longitudinal EMR data only account for 6.9% of the total data and the remaining 93.1% were single-visit data, which was indeed the case. Therefore, it is necessary to improve the recommendation model for optimizing the fine-tuning process of the single-visit pre-training model.
Based on the above analysis, we propose the MR-KPA model which combines knowledge-enhanced pre-training with a deep adversarial network to improve medication recommendation from both feature expression and recommendation model structure, for realizing medication recommendation based on small-scale longitudinal EMR data. The details are introduced in the following sections.

Medical codes
Medical codes are usually categorized according to a tree-structured classification system for diagnosis and drug. Figure 1 gives tree structures of ICD-10 ontology and NDC ontology. All codes are the lowest leaf nodes. The left of Fig. 1 is an example of ICD-10 J98.4 which is the ICD-10 code of "other lung diseases" and its sibling node J98.1 is the ICD-10 code of "Pulmonary collapse". They have a common parent node J98. This means that both these two kinds of diseases belong to "other respiratory diseases" whose ICD-10 is J98.
The right of Fig. 1 is an example of NDC(Chinese National Drug Code). 869004500000 11(86,9,00450,00001,1) is the NDC code of "Ceftazidime for Injection". The codes in line with Chinese national drug coding standards have 14 digits. The first 2 digits "86" are the drug country code and the third digit "9" represents the drug category code. The fourth to eighth "00450" represents the enterprise identifier and the ninth to thirteenth "00001" represents the product identifier. The last digit "1" represents different drugs.

Data sets
In this study, the real EMRs are from medical and health institutions in Hainan Province, China. OUTPATIENT DIAG CODE records the ICD-10 codes of diagnosis, DRUG STANDARD CODE records NDC codes of drug and CHIEF COMPLAINTS records the patient's current symptoms. This study uses word segmentation to divide symptom description sentence into words, and then remove pause words during word segmentation to create the symptom set of each EMR. Table 2 gives the data statistics. The single-visit records were used for training the pretraining model and the multiple-visit records were used for training and testing the prediction model. Compared with those data sizes in Table 1, our data set is very small.

An overview
A longitudinal sequential medication recommendation task can be defined as follows: where n represents the n-th patient and T (n) is the number of visits of the n-th patient. The EMR record of the t-th visit is described as P t is a collection of diagnostic codes for ICD-10, m (n) t is a collection of drug codes for National Drug Codes (NDC), s (n) t is the collection of self-reported symptoms named as SYM, such as "fever".
Definition 2: Longitudinal Sequential Medication Recommendation Task. Given the n-th patient's history EMR records S (n) t , drug codes m (n) t and symptoms s (n) t at the t-th visit, we want to recommend the drugs at the t-th visit by generating multi-label output ŷ t ∈ {0, 1} ML which ML represents the number of drug codes. That is to say, the output of the medication recommendation is a list of appropriate drugs. And the recommendation problem is transformed to a multi-label classification problem.
This study proposed a MR-KPA model to realize this task based on small-scale data. On the one hand, the proposed model adopts a knowledge-enhanced pre-training. A large number of single-visit EMR data is used as the pre-training data for avoiding segment limited longitudinal EMR data. The classification knowledge of diagnostic and drug codes was encoded as external domain features and then fused into EMR embeddings. On the other hand, this model integrated adversarial training into multilayer perceptron (MLP) to avoid the over-fitting of model during the fine-tuning process.
The whole framework of MR-KPA is described in Fig. 2. It includes three modules: input representation, pre-training and prediction. The input representation module transforms each EMR record into the diagnosis code embedding, the drug code embedding and the symptom embedding. Based on these three types of embeddings, the pre-training module creates a pre-training visit model by performing two types of pre-training tasks. Finally, the prediction module fine-tunes the pre-training visit model and obtains the predicted drug code based on patient's multiple-visit records. The details will be described in the following subsections.

Input representation
The input representation module transforms each EMR into a group of multi-dimensional embeddings as the input of the subsequent module. As shown in Fig. 3, multiplevisit records are inputted into this module. Each record includes columns SUBJECT ID, HADM ID, ICD-10, NDC, and SYM, which represent the patient ID, hospital ID, diagnostic code, drug code, and symptom participle respectively. They are transformed into two ontology embeddings and one dictionary embedding.
For the EMR of n-th patient at t-th visit P , its input embedding can be obtained as follows.
Ontology embedding. Ontology embedding is adopted to realize domain knowledgebased external feature fusion. Two types of code ontology embeddings are constructed from ICD-10 ontology O d and NDC ontology O m . Because medical codes in raw EMR data are leaf nodes in code ontology trees, code ontology embedding can be obtained by using graph attention network (GAT) [8,10,12,13]. It can encode the classification knowledge in diagnostic and drug code trees as external domain features. For each Fig. 3 The framework of Input Representation. Both diagnose embedding and medicine embedding adopt ontology embeddings based on code trees. Symptom embedding adopts the dictionary embeddings medical code c * ∈ d (n) t ∪ m (n) t is the embedding dimension, and then the procedure is performed to obtain its ontology embedding as follows: where * ∈ d, m , N c * = {{c * } ∪ {pa(c * )}} are the parent nodes of c * and itself, represents concatenation which enables the multi-head attention mechanism, σ is a nonlinear activation function, W k ∈ R m×d is the weight matrix for input transformation, and a k c * ,j is the corresponding k-th normalized attention coefficient.
Dictionary embedding. Dictionary embedding is constructed from a symptom dictionary D s , which contains all symptoms in EMR data. For each symptom s i ∈ s (n) t , its dictionary embedding d s i is just its index value in D s .

Pre-training
The pre-training module creates a pre-training visit model based on the input embedding transformed from single-visit records of EMR. By pre-training, a large number of single visit data are effectively used to mine the richer internal features of EMR.
Before pre-training, a multi-layer Transformer architecture [50] is adopted to derive visit embedding from two ontology embedding and one dictionary embedding of each EMR data. For P (n) t , three types of visit embedding can be obtained as follows: where v t d is diagnostic visit embedding, v t m is drug visit embedding, v t s is symptom visit embedding, and [CLS] is the first tag of each sequence whose final hidden state will be used as an aggregate sequence representation of the classification task for enabling BERT to better handle various downstream tasks. In order to obtain the consistent length of the input token, it is necessary to align the tokens obtained by padding. This paper conducts the following two kinds of pre-training tasks to make visit embedding absorb enough information about medication recommendation.
Mask EMR Field Task (Mask EF Task). This task randomly masks some of the embedding to better represent information about the composition of EMR records. By changing word token masking of sentences [51] into field masking of EMR records, the following loss function is calculated: is an union set of medical codes and symptoms of n-th patient, c * ∈ C (n) * denotes a medical code or symptom involved in the n-th patient and c * ∈ c * \ c (n) * denotes the medical codes or symptoms not used for the n-th patient, * ∈ d, m, s . We minimize the binary cross entropy loss L s to make the model have stronger self-prediction ability.
Correlation Prediction Task (CorP Task). This task is used to represent information about the correlation among diagnostic codes, drug codes and symptoms. In BERT, the next sentence prediction (NSP) task facilitates the prediction of sentence relations. G-Bert revised the NSP task as the multidirectional prediction task for predicting unknown disease or drug codes of the sequence [16]. This paper revises the NSP task [52] as the CorP Task. For mutual prediction of diagnostic codes, drug codes and symptoms, the following three loss functions are calculated: Finally, the pre-training optimization objective can simply be the combination of the aforementioned losses:

Prediction
A MLP module with adversarial training is used to achieve the final prediction task. Based on the pre-training model, multi-visit EMR sequences can be transformed to three types of visit embedding sequences. Concatenating the average of previous diagnostic visit embedding, drug visit embedding, and symptom visit embedding before the t-th visit, as well as the diagnostic visit embedding and symptom visit embedding at the t-th visit, the MLP [53] can predict the recommended drug codes at the t-th visit as follows: where W ∈ R |C m |×3l is a learnable transformation matrix.
Therefore, the loss function can be calculated as follows: Lin et al. BMC Bioinformatics (2022) 23:552 where y is the predicted value sequence and ŷ is the true value sequence. In this formula, t =2 means that the prediction starts from the second visit of the patient. The reason is that this paper focuses on longitudinal sequential medication recommendation which predicts the drugs currently suitable for the patient based on the patient's historical and current diagnosis and symptom.
In order to avoid the over-fitting of model, this paper integrates the adversarial training FGM into the deep prediction model [54]. Adversarial training can not only improve the defense ability of the model against adversarial samples, but also improve the generalization ability of the original samples. For the prediction task, the disturbance r adv−d and r adv−m are added to the diagnostic ontology embedding and the drug ontology embedding respectively, in order to make the model wrong as much as possible and increase the robustness. Referring to [54], the disturbance can be calculated as follows: where ǫ is a constant. r adv−d and r adv−m are normalized values with the gradient of v t d and v t m . The drug sequence y t is predicted from the disturbed v τ ′ d and v τ ′ m which can be combined with the real drug sequence ŷ t to construct a loss function. In back propagation, the gradient of counter training is accumulated on the basis of the normal gradient. Then the original values of v τ d and v τ m are restored. Finally, the parameters are updated according to the gradient of accumulated adversarial training. The loss function after adversarial training is defined in the same way as Eq. (11) where y t is calculated from the disturbed diagnostic ontology embedding and drug ontology embedding on the basis of Eq. (13) as follows:

Baselines
We compared the proposed MR-KPA with the following baseline methods. All methods were developed under Pytorch and implemented on Nvidia Quadro P2000: • Learn to Prescribe (LEAP) [34]: LEAP is an example based model that aims to prescribe effective and safe drug combinations for patients with recurrent diseases. It uses cyclic decoders to model labels and captures label instance maps using contentbased attention in order to decompose treatment recommendations into a continuous decision-making process while automatically determining the appropriate drug quantity. The epoch of this model is set as 30. (12) • Logistic Regression (LR) [55]: This study adopts a logistic regression model with L1/L2 regularization as the baseline method. We represented sequential multiple medical codes by summing up multiple hot vectors per visit. • Reverse Time Attention Model (RETAIN) [17]: RETAIN is a medication recommendation model based on a two-stage neuro attention that examines past influential visits and important clinical variables such as critical diagnoses within those visits. In this study, the epoch of the model is set to 30 which has the best performance by experiment. When the model predicts that the probability of a drug being recommended is greater than 30% , the drug is recommended.

Metrics
This paper uses the Jaccard Similarity Coefficient [56] and average F1 [57] to measure experimental results. They can be calculated as follows: where Ŷ (k) t is the predicted set and Y (k) t is the ground truth set.
where P (k) t is the precision rate, R (k) t is the recall rate, N is the number of patients in the test set and T k is the number of visit of the k-th patient. And we also use Precision Recall AUC (PR-AUC) to evaluate the performance of the algorithm. Table 3 shows the performance results of different models. LEAP is obviously less effective than other baseline models and the proposed MR-KPA. As an instance-based medication recommendation model, LEAP does not take into account longitudinal EMR data. Therefore, this results prove that it is necessary to adopt the longitudinal sequential method, namely medication recommendation based on longitudinal EMR data in this study. LR is a shallow machine learning model and widely used in medication recommendation. RETAIN is a medication recommendation model based on the deep neural network. Compared with their results, the Jaccard score and PR-AUC score of LR are significantly higher than those of RETAIN. This indicates that, the deep learning models are no better than traditional shallow machine learning models based on the small-scale longitudinal EMR data. Therefore, it is also necessary to adopt the knowledge-enhanced pre-training visit model for realizing few-shot learning in this study. Finally, the proposed MR-KPA obtains the best results on all evaluating indicators. This shows that the proposed model can effectively improve the accuracy of medication recommendation based on small-scale longitudinal EMR data.

Discussion
Knowledge enhancement based on ontology embedding, the pre-training visit model and adversarial training are three core optimizations in this paper. This section will discuss their effectiveness by an ablation study. Seven MR-KPA variants are designed as follows:   Table 4 gives the experimental results of the ablation study. Compare the baseline models, the result of MR − KPA K −,P−,A− is similar to that of RETAIN. Its three evaluating indicators are significantly higher than those of LEAP and two evaluating indicators are lower than those of LR. This once again proves the necessity of adopting longitudinal sequential medication recommendation and the shortcomings of deep learning models in medication recommendation based on small-scale longitudinal EMR data. Referring to [54], this section will further discuss the training effects of the three optimizations through the analysis of training loss curve. Figure 4 gives the learning curves Table 4 Experimental results of the ablation study This indicates knowledge enhancement based on ontology embedding affects the training speed in the early stage, but it has little impact on the recommendation results of the whole model. This is consistent with the results in Table 3. MR − KPA K − has the closest result to MR-KPA. This indicates that knowledge enhancement based on ontology embedding has the minimal improvement effect on the EMR-based medication recommendation task. Figure 4b gives the comparison of training loss between MR-KPA and MR − KPA P− . With the increase of iteration times, the loss of MR-KPA gradually decreased, but the loss change of MR − KPA P− is not obvious. This indicates that the pre-training visit model is the key to ensure the convergence of the model on relatively small-scale longitudinal EMR data. It has a significant effect on improving the edication recommendation based on small-scale longitudinal EMR data. This is also consistent with the results in Table 3. Among MR − KPA K − ,MR − KPA P− and MR − KPA A− , MR − KPA P has the worst results. Figure 4c gives the comparison of training loss between MR-KPA and MR − KPA A− . With the increase of iteration times, the downward trend of loss of MR − KPA A− is much slower than that of MR-KPA. The loss values MR-KPA are always below that of MR − KPA A− in the later. This indicates adversarial training has played a role in preventing the model from over-fitting on small-scale longitudinal EMR data. Therefore, it can effectively improve medication recommendation based on small-scale longitudinal EMR data, as shown in Table 3.

All of three optimizations are effective and compatible
Comparing Fig. 4a-c, only the loss curve of MR − KPA P− decreases slowly, and even has an upward trend in the later period, indicating that the model does not converge. Therefore,  Table 3. That is to say, the pre-training visit model are the most effective optimization in this study.

Limitations of this study
There are still some limitations in this study. Due to the addition of adversarial training, the computational complexity of the proposed MR-KPA inevitably increases, and the running time also increases. However, due to the small-scale training data, this limitation of recommendation model can be compensated partly. Another limitation of this study is that the temporal features of longitudinal data are not fully utilized. Therefore, an important future work is to effectively mine temporal features by various deep neural network, such as linear networks.

Conclusion
In this paper, we propose a new EMR-based medication recommendation model called MR-KPA. By combining knowledge-enhanced pre-training with the deep adversarial network, MR-KPA improves both feature representation and the fine-tuning process to realize effectively medication recommendation based on small-scale EMR data. To our best knowledge, MR-KPA is real first that integrates current popular graph neural network, pre-training and adversarial training for EMR-based medication recommendation. The ablation experiments and comparative experiments prove that these three technologies are complementary and their integration makes the proposed MR-KPA model effectively realize medication recommendation on small-scale longitudinal EMR data. By reducing the dependence on high-quality labelled data, this study can greatly reduce the time and economic costs required for model construction, and help to promote the comprehensive application of EMRs based medication recommendation.

Author Contributions
Conceptualization, SL and JC; methodology, SL, MW and JC; software, MW and JC; validation, SL and JC; investigation, MW and CS; data curation, MW; writing-original draft preparation, MW; writing-review and editing, JC, MW, SL, ZX, LC, and QG; visualization, MW; All authors read and approved to the final manuscript.

Funding
This study was supported by National Key Research and Development Program of China (Grant No. 2020YFB2104402) and Beijing Natural Science Foundation (No. 4222022).

Availability of data and materials
The datasets generated and analysed during the current study are not publicly available due to privacy restriction from hospitals but are available from the corresponding author on reasonable request. The source codes are publicly available in the GitHub repository, https:// github. com/ Mengz henWa ngmz/ MR-KPA.

Declarations
Ethics approval and consent to participate